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Abstract 

A family of gossiping algorithms depending on a parameter permutation is in¬ 
troduced, formalized, and discussed. Several of its members are analyzed and 
their asymptotic behaviour is revealed, including a member whose model and 
performance closely follows the one of hardware pipelined processors. This sim¬ 
ilarity is exposed. An optimizing algorithm is dually proposed and discussed 
as a general strategy to increase the performance of the base algorithms. 


1 Introduction 


A number of distributed applications like, e.g., distributed consensus [15], or 
those based on the concept of restoring organs [4,12] (A-modular redundancy 
systems with iV-replicated voters—for instance, the distributed voting tool 
described in [3]), require a base service called gossiping [8,1,5]. 

Informally speaking, gossiping is a communication procedure such that every 
member of a set has to communicate a private value to all the other mem¬ 
bers. Gossiping is clearly an expansive service, as it requires a large amount 
of communication. Implementations of this service can have a great impact 
on the throughput of their client applications and perform very differently de¬ 
pending on the number of members in the set. This work describes a family of 
gossiping algorithms that depend on a combinatorial parameter. Three cases 
are then analyzed under the hypotheses of discrete time, of constant time for 
performing a send or receive, and of a crossbar communication system. It is 
shown how, depending on the pattern of the parameter, gossiping can use from 
O(A^) to 0(A) time, N being the number of communicating members. The 
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last and best-performing case, whose activity follows the execution pattern of 
pipelined hardware processors, is shown to exhibit an efficiency constant with 
respect to N. This translates in unlimited scalability of the corresponding gos¬ 
siping service. When performing multiple consecutive gossiping sessions, the 
throughput of the system can reach the value of t/2, t being the time for send¬ 
ing one value from one member to another, or a full gossiping is completed 
every two basic communication steps. 

The structure of the paper follows: first, in Sect. 2, a formal model for the 
family of algorithms is provided. The following three sections (Sect. 3, Sect. 4, 
and Sect. 5) introduce, analyze, and discuss three members of the family, 
showing in particular that one of them, whose behaviour resembles the one of 
pipelined hardware microprocessors, uses 0{N) time, N being the number of 
employed nodes. An optimizing algorithm is then introduced in Sect. 6. Sec¬ 
tion 7 describes two applications of our algorithms. Finally Sect. 8 summarizes 
our contributions and draws a number of conclusions. 


2 A Formal Model 


Definition 1 (system) Let N >0. A -|- 1 processors are interconnected via 
some communication means that allows them to communicate with each other 
(for instance, by means of full-duplex point-to-point communication lines). 
Communication is synchronous and blocking. Processors are uniquely identi¬ 
fied by integer labels in {0, ... ,N}; they will be globally referred to, together 
with the communication means, as “the system”. 

Definition 2 (problem) The processors own some local data they need to 
share (for instance, to execute a voting algorithm [V2]). In order to share their 
local data, each processor needs to broadcast its own data to all others, via mul¬ 
tiple sending operations, and to receive the N data items owned by its fellows. 
This must be done as soon as possible. We assume a discrete time model — 
events occur at discrete time steps, one event at a time per processor. This is 
a special class of the general family of problems of information dissemination 
known as gossiping [8,1,5]. We will refer to this class as “the problem”. 

Definition 3 (time step) We assume the time to send a message and that 
to receive a message is constant. We call this amount of time a “time step’’. 

Definition 4 (actions) On a given time step t, processor i may be: 

(1) sending a message to processor j,j ^ i; this is represented in form of 
relation as i S^j; 

(2) receiving a message from processor j, j ^ i; this is represented as iR^j; 
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(3) blocked, waiting for messages to he received from any processor; where 
both the identities of the involved proeessors and t can be omitted without 
ambiguity, symbol ” will be used to represent this ease; 

(4) blocked, waiting for a message to he sent i.e., for a designee to enter the 
receiving state; under the same assumptions of ease (3), symbol “r\” will 
be used. 

The above eases are referred to as “the actions” of a time step. 

Definition 5 (slot, nsed slot, wasted slot) A slot is a temporal “window” 
one time step long, related to a proeessor. On each given time step there are 
N + 1 available slots within the system. Within that time step, a proeessor 
may use that slot (if it sends or receives a message during that slot), or it 
may waste it (if it is in one of the remaining two eases). In other words: 

Processor i makes use of slot t (represented by predicate U(t, i)) if and only if 

Uit,t)= “3j{tS^jVtR^j)” 

is true; on the eontrary, processor i is said to waste slot t iff -^UitO). 

The following notation. 


^ ^ll if U it, i) is true, 

10 otherwise, 

will he used to count used slots. 

Definition 6 (states WR, WS, S, R) Let us define four state templates for a 

finite state automaton (FSA) to he described later on. 

WR state. A processor is in state WRj if it is waiting for the arrival of 
a message from processor j. Where the suhseript is not important it will 
be omitted. Once there, a proeessor stays in state WR for zero (if it can 
start receiving immediately) or more time steps, corresponding to the same 
number of aetions “wait for a message to come. ” 

S state. A proeessor is in state Sj when it is sending a message to addressee 
proeessor j. Note that by the above assumptions and definitions this tran¬ 
sition lasts exactly one time step. To each transition to the S state there 
corresponds exactly one “send” aetion. 

WS state. A processor which is willing to send a message to processor j is 
said to be in state WSj. Where the subscript is not important it will he 
omitted. The permanenee of a processor in state WS implies zero (if the 
processor can send immediately) or more oecurrenees in a row of the “wait 
for sending” aetion. 

R state. A processor which is receiving a message from processor j is said to 
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be in state Rj. By the above definitions, this state transition also lasts one 
time step. 

Let Vi,, Vn represent a permutation of the N integers — l,i + 

1,..., iV. Then the above state templates can be used to compose iV + 1 hnite 
state automata making use of the following algorithm [i G {0, ... ,N}): 

Algorithm 1 : Compose the FSA which solves the problem of Def. 2 for 
processor i 

Input: A = {i, N, V) 

Output: FSA(A) 

1 begin 

2 FSA (A) := START { 

3 for j := 0 to i — 1 do 

{ operator ” pushes a state 

4 FSA(A) := FSA(A) ^ WR 

5 FSA(A) := FSA(A) ^ R 

6 enddo 

7 for j := 1 to A do 

8 FSA(A) := FSA(A) ^ WSp. 

9 FSA(A) := FSA(A) ^ 

10 enddo 

11 for j := i + 1 to N do 

12 FSA(A) := FSA(A) ^ WR 

13 FSA(A) := FSA(A) ^ R 

14 enddo 

15 FSA(A) := FSA(A) ^ STOP { 

16 end. 

Figure 1 for instance shows the state diagram of the FSA to be executed by 
processor i. The hrst row represents the condition that has to be reached before 
processor i is allowed to begin its broadcast: a series of i couples (WR, R). 

Once processor i has successfully received i messages, it gains the right to 
broadcast, which it does according to the rule expressed in the second row of 
Fig. 1: it orderly sends its message to its fellows, the j-th message being sent 
to processor Vj. 

The third row of Fig. 1 represents the reception of the remaining N — i mes¬ 
sages, coded as N — i couples like those in the first row. 

We experimentally observed that, regardless the value of V, such FSA’s repre¬ 
sent a distributed algorithm which solves the problem of Dehnition 2 without 
deadlocks. As intuition may suggest, the choice of which permutation to use 
has indeed a deep impact on the overall performance of the algorithm, together 


emit the initial state } 
on top of a FSA } 


emit the final state } 
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Fig. 1. The state diagram of the FSA run by processor i. The hrst row consists 
of i couples {WR,R). {Vi,...,Vn) represents a permutation of the N integers 
0,... ,i — l,i + 1,..., N. The last row contains N — i couples {WR, R). 

with the physical characteristics of the communication lin^n. Reporting on 
this impact is one of the aims of this paper. 

To this end, let us furthermore define: 

Definition 7 (run) The collection of slots needed to fully execute the above 
algorithm on a given system, together with the value of the corresponding ac¬ 
tions. 

Definition 8 (average slot utilization) The average number of slots used 
during a time step. It represents the average degree of parallelism exploited in 
the system. It will be indicated as piN, or simply as /i. It varies between 0 and 
N + 1. 

Definition 9 (efficiency) The percentage of used slots over the total number 
of slots available during a run. Sjq, or more simply e, will be used to represent 
efficiency. 

Definition 10 (length) The number of time steps in a run. It represents a 
measure of the time needed by the distributed algorithm to complete. Xn, or 
more simply X, will be used for lengths. 

Definition 11 (number of slots) cr(iV) = {N-\-l)Xi\f represents the number 

^ For instance, in case of a bus, an ALOHA system (see e.g., [23]), or other shared 
medium systems, a number of used slots greater than 2 implies a collision i.e., a 
penalty that wastes the current slot; using transputers [6], each of which has four 
independent communication channels, used slots cannot be more than 8; while in 
a fully interconnected end-to-end system, that figure can grow up to its maximum 
value, 2[(A^ -|- l)/2j, without any problem. 
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of slots available within a run of N + 1 processors. 

Definition 12 (number of used slots) For each run and each time step t, 

N 

i=0 

represents the number of slots that have been used during t. 

Definition 13 (utilization string) The \-tuple 

z7= [z/i,z/2,...,z/a], 

orderly representing the number of used slots for each time step, is called uti¬ 
lization string. 

In the next Sections, we introduce and discuss three cases of V. We will show 
how varying the structure of V may develop extremely different values of /i, e, 
and A. This fact, coupled with physical constraints pertaining the communica¬ 
tion line and with the number of available independent channels, determines 
the overall performance of this algorithm. 

In the following we assume the availability of a fully connected (crossbar) 
interconnection [17] that allows any processor to communicate with any other 
processor in one time step. 


3 First Case: Identity Permutation 


As a first case, let us assume that the structure of V be fixed. For instance, 
let V be equal to the identity permutation: 

/O, ...,i-l,i + l,...,iVA 

VO, ...,i-l,i + l,...,iVy’ ^ ^ 

i.e., in cycle notation [13], (0)... (i — l)(i -|- 1)... (iV). 

This means that, once processor i gains the right to broadcast, it will first try 
to send its message to processor 0 (possibly having to wait for it to become 
available to receive that message), then it will do the same with processor 1, 
and so forth up to N, obviously skipping itself. This is effectively represented 
in Table 1 for iV = 4. Let us call this a run-table. 

It is possible to characterize precisely the duration of the algorithm adopting 
this permutation: 
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Table 1 

A run (N = 4), with V equal to the identity permutation. The step row represents 
time steps. Id’s identify processors, z? is the utilization string (see Def. 13.) In this 
case fi, or the average utilization is 2.22 slots out of 5, with an efficiency e = 44.44% 
and a length A = 18. Note that, if the slot is used, then entry {i,t) = TZj of this 
matrix represents relation iTV'j. 

Proposition 14 Xn = + |iV + i[iV/2j. 


PROOF (by induction) Let us consider run-table iV-|-1. Let us strip off its 
last row; then wipe out the \_{N -|- l)/2j — 1 leftmost columns which contain 
element Sn+i- Let us also cut out the whole right part of the table starting at 
the column containing the last occurrence of S'^r+i. Finally, let us rename all 
the remaining S'at+i’s as 

Our hrst goal is showing that what remains is run-table N. To this end, let us 
hrst point out how the only actions that affect the content of other cells in a 
run-table are the S actions. Their range of action is given by their subscript: 
an Sn +1 for instance only affects an entry in row iV -|- 1. 

Now consider what happens when processor i — 1 sends its message to processor 
i and this latter gains the right to broadcast as well: at this point, processor i 
starts sending to processors in the range {0,...,i — 2} i.e., those “above”; as 
soon as it tries to reach processor i — 1, in the case this latter has not hnished 
its own broadcast, i enters state lids' and blocks. 

This means that: 

(1) processors “below” processor i will not be allowed to start their broadcast, 
and 

(2) for processor i and those “above”, p, or the degree of parallelism, is always 
equal to 2 or 4—no other value is possible. This is shown for instance in 
Table 1, row “z7”. 

As depicted in Fig. 2, processor i gets blocked only if it tries to send to 
processor i — 1 while this latter is still broadcasting, which happens when 
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i-1 


id step 


t t+I t+2 t+3 


t+i-1 t+i 


i-2 


i-1 


i+1 

i+2 

i+3 



2i-l 


2i 


y 


IVS 


Fig. 2. Processor i — 1 blocks processor i only if — 1 < N. A transmission i.e., 
two used slots, is represented by an arrow. In dotted arrows the sender is processor 
i — 1, for normal arrows it is processor i. Note the cluster of i — 1 columns with two 
concurrent transmissions (adding up to 4 used slots) in each of them. 

i < —this condition is true for any processor j G — !}• Note 

how a “cluster” appears, consisting of j — 1 columns with 4 used slots inside 
(Table 2 can be used to verify the above when N is 7.) Removing the hrst 
occurrences of S'jv+i (from row 0 to row — 1) therefore simply 

shortens of one time step the stay of each processor in their current waiting 
states. All remnant columns containing that element cannot be removed— 
these occurrences simply vanish by substituting them with a ” action. 

Finally, the removal of the last occurrence of Rat+i from the series of sending 
actions which constitute the broadcast of processor N allows the removal of 
the whole right sub-table starting at that point. The obtained table contains 
all and only the actions of run N] the coherence of these action is not affected; 
and all broadcast sessions are managed according to the rule of the identity 
permutation. In other words, this is run-table N. 

Now let us consider a{N + 1): according to the above argument, this is equal 
to: 

(1) the number of slots available in a Al-run i.e., 

(2) plus 77-1-1 slots from each of the columns that witness a delay i.e., 

L(iV + 1)/2J . (iV + 1), 
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Table 2 

Run-table 7 for V equal to the identity permutation. Average utilization is 2.38 
slots out of 8, or an efficiency of 29.79%. 

(3) plus the slots in the right sub-matrix, not counting the last row i.e., 

(iV + l)(iV + 2), 


(4) plus an additional row. 


In other words, (j{N + 1) can be expressed as the sum of the above hrst three 
item multiplied by a factor equal to This can be written as an equation 
as 


a{k + 1) — {cr{k) + [—^—\{k -b 1) + (A; -b l){k + 2)) ^ (3) 
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By Definition 11, this brings to the following recursive relation: 


/--Li 

X{k + l) = \{k) + l^\+k + 2. 


( 4 ) 


Furthermore, the following is true by the induction hypothesis: 

A„ = A(/V) = j/V" + j/V + iLA'/2J, (5) 


Our goal is to show that Eq. (4) and Eq. (5) together imply that \{N + 1) = 
\n+i, this latter being 


Xn +1 — + 1 )^ + + 1 ) + 2 L —2 — ^ 

= -N^ H- N + 2 + - - . 

4 4 2^2^ 


( 6 ) 

( 7 ) 


Now, let us suppose N is even—this implies that \_{N + 1)/2J = [iV/2j = N/2. 
Exploiting this in Eq. (4) for k = N and in Eq. (5), and substituting the latter 
in the former Equation, brings us to the following result: 


A(iV + l) 


iV+ 1 

A(iV) + L^J+iV + 2 
Av + 2"^ 4" ^ 


+ ^ + + 2 
4 4 4 2 

+ 3iV + 2 

-At2 + llAr + 2 + 

4 4 2^2"^ 


which is equal to Aat+i because of Eq. (7). On the other hand, if iV > 0 is 
odd, then [{N + 1)/2J = (iV +1)/2, while [N/2\ = {N — l)/2. With the same 
approach as above we get: 
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XiN + 1) = X{N) + +N + 2 

= ^N^ + ^N+^lN/2\ + {N + l)/2 + N + 2 
= + ^N+{N- 1)/4 +{N + 1)/2 + N + 2 

= + ^N +^ + ^N + ^ + {N + l)/4 


which is again equal to Aat+i because of Eq. (6). □ 


Lemma 15 The number of columns with f used slots inside, for a run with 
V equal to the identity permutation and iV + 1 processors, is 

N-l ■ 

E L^J- (8) 

i=0 ^ 


PROOF. Figure 2 shows also how, for any processor l<z< [(A^ + l)/2j, 
there exists only one cluster of i — 1 columns such that each column con¬ 
tains exactly 4 used slots. Moreover Fig. 3 shows that, for any processor 
[(A^-|- l)/2j -|- 1 < i < iV, there exists only one cluster of — i columns 
with that same property. 


N-i 


id step 


t+i-1 t+i t+i+1 ... t+N-1 t+N 


i-1 

i 

i+1 

i+2 

N 


Fig. 3. For any processor i > [(A^ -|- l)/2j, there exists only one cluster of A^ — i 
columns with 4 used slots inside. 
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Let us call M 4 this number and count such columns: 


L(Af+l)/2j N 

Ui= Y. (*-!)+ II (9) 

*=1 i=L(Af+l)/2j+l 

Via two well-known algebraic transformations on sums (see e.g., in [7,18]) we 
get to 

L(7V+1)/2J-1 iV_L(iV+i)/2j-l 

“4= E E (Af- L^J -1 -i)- ( 10 ) 

Now, if N is even, then 


Ar/2-1 7V/2-1 ^ 

«4= E 7+ E (y 

j=0 j=0 ^ 

N/2-1 ^ 

= E (yi) 

j=o ^ 

Hf)(f-i) 

Ar/2-1 

=2 j;: i 

i=0 
N-l ■ 

= E bJ. 

i=o ^ 


1 - i) 


( 11 ) 


while, if N is odd. 


(7V-l)/2 


(Af-3)/2 


N-3 


Ua= E 7+ E 

i=o j=o 

TV - 1 N -3 

^ j=o ^ 

N-l N-IN-3 


2 

TV- 1 


2 

v-i • 

= EL^J- 

i=0 ^ 


2 2 

{N-3)/2 

2 E * 

i=0 


- i) 


□ 


( 12 ) 
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Figure 4 shows the typical shape of run-tables in the case of V being the 
identity permutation, also locating the 4-used slot clusters. 

The following Propositions locate the asymptotic values of p and e: 

Proposition 16 hmfe_j,oo£fc = 0. 


PROOF. Let us call U{k) the number of used slots in a run of k processors. 
As a consequence of Lemma 15, the number of used slots in a run is 


^ 

U{k)^2Y.[\\+‘2\ 

i=0 ^ 

From Dehnition 11 we derive that 

U{k) 

{k + l)Afc 


(13) 


(14) 


Eq. (11) and Eq. (12) show that deg[17(/i;)] = 2, while from Prop. 14 we know 
that deg[(/i; -|- 1) • A^] = 3. As a consequence, eu tends to zero as k tends to 
inhnity. □ 


Proposition 17 = §• 


PROOF. Being 

_U{k) 

it is possible to derive that 


(15) 


2 ^ -|- 2^k‘^ -|- ... some 1st degree elements 

l/c^ -|- ... some 1st degree elements 

2k‘^ + ... some 1st degree elements 

jk"^ -|- ... some 1st degree elements^ 

which tends to |, or 2.6, when k goes to inhnity. □ 


(16) 
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Fig. 4. A graphical representation for run-table 20 when V is the identity per¬ 
mutation. Light gray pixels represent wasted slots, gray pixels represent R actions, 
black slots are sending actions. Note the black “blocks” which represent the clusters 
mentioned in Fig. 2 and Fig. 3. 



Run-table 5 when V is chosen pseudo-randomly. ^ is 2.5 slots out of 6, which implies 
an efficiency of 41.67%. 

4 Second Case: Pseudo-random Permutations 


This Section covers the case such that P is a pseudo-randonf^ permutation 
of the integers 0,..., i — 1, i -1- 1,..., iV. 

Figure 5 shows the values of A using the identity and random permutations 
and graphs the parabola who best hts with these latter values. We conclude 
that, experimentally, the choice of case one is even “worse” than choosing 
permutations at random. The same conclusion follows from Fig. 6 and Fig. 7 
which respectively confront the averages and efficiencies in the above two cases. 

Table 3 shows run-table 5, and Fig. 8 shows the shape of run-table 20 in this 
case. 


^ The standard C function “random” [22] has been used—a non-linear additive 
feedback random number generator returning pseudo-random numbers in the range 
[0, 2^^ — 1] with a period approximately equal to 16(2^^ — 1). A truly random integer 
has been used as a seed. 
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20 40 60 80 100 120 140 160 

N 


Fig. 5. Comparison between lengths in the case of the identity permutation (dotted 
parabola) and that of the random permutation (piecewise line), 1 < < 160. 

The lowest curve (A = 0.71iV^ — 3.88N + 88.91) is the parabola best fitting with 
the piecewise line—which suggests a quadratic execution time as in the case of the 
identity permutation. 



N 


Fig. 6. Comparison between values of // in the case of a pseudo-random permutation 
(piecewise line) and that of the identity permutation (dotted curve), 1 < < 160. 

Note how the former is strictly over the latter. Note also how n seems to tend to a 
value right above 2.6 for the identity permutation, as claimed by Prop. 17. 
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Fig. 7. Comparison between values of e in the case of the random permutation 
(piecewise line) and that of the identity permutation (dotted curve), 1 < < 160. 

Also in this graph the former is strictly over the latter, though they get closer to 
each other and to zero as N increases, as proven for the identity permutation in 
Prop. 16. 



Fig. 8. A graphical representation for run-table 20 when V is a pseudo-random 
permutation. 

5 Third Case: the Algorithm of Pipelined Broadcast 


Let V be the following permutation; 

/O,... ,i — l,i + 1,..., N\ 


(17) 


Note how permutation (17) is equivalent to i cyclic logical left shifts of the 
identity permutation. Note also how, in cycle notation [13], (17) is represented 
as one cycle; for instance, 


/0,1,2,4,5\ 

V4,5,0,l,2y’ 

i.e., (17) for iV = 5 and i = 3, is equivalent to cycle (0,4,1, 5, 2). 

A value of V equal to permutation (17) means that, once processor i has gained 
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id 

1 
0 

1 

2 

3 

4 

5 

6 

7 

8 
9 


Step 


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 

52 53 54 55 56 57 58 59 - - 

Ro rx 52 53 54 55 56 57 58 59 5o - - 

- RoRi 53 54 55 56 57 58 59 5o 5i - RsR^R^RqRt RsRg - 

- -RoRiR 2 rx 54 55 56 Sj Ss Sg Sg 5i 52 - R^R^R^R^RfiRg - 

-^ ^5 56 57 58 59 5o 5i 52 53 - R^RgRrRsRg - 

- R 0 R 1 R 2 R 3 R 4 Sq S 7 Sg Sg So Si S 2 S 3 54 - ReRjRgRg - 

- RoRiR 2 R^RaR^ St Sg Sg Sq Si S2 S3 S4 S5 - RjRsRg - - 

- RgRlR 2 R 3 R^R 3 Re rx Sg Sg Sq Si S 2 S 3 S 4 S 5 Sg - RgRg - 

- — — — — — — Rq Ri R 2 R 3 Ra R 3 Rg Rj rx Sg Sg Si S 2 S 3 Sa S 3 Sg Sj — Rg 

- R 0 R 1 R 2 R 3 RAR 5 R 6 R 7 R 8 rx Sg Si S 2 S 3 Sa S 3 Sg Sj Sg 


22446688 10 8 10 8 10 8 10 8 10 8 10 88664422 


Table 4 

Run-table of a run for = 9 using permutation of Eq. (17). In this case //, or the 
average utilization is 6.67 slots out of 10, with an efficiency e = 66.67% and a length 
A = 27. Note that z? is in this case a palindrome i.e., as well known [24], a string 
like “21012” which can be read indifferently from left to right or vice-versa. 


the right to broadcast, it will first send its message to processor i + \ (possibly 
having to wait for it to become available to receive that message), then it will 
do the same with processor i-|-2, and so forth up to iV, then wrapping around 
and going from processor 0 to processor i — 1. This is represented in Table 4 
for N = 9. 


Pictures quite similar to Table 4 can be found in many classical works on 
pipelined microprocessors (see e.g. [17, p. 132-133].) Indeed, a pipeline is a 
series of data-paths shifted in time so to overlap their execution, the same 
way Eq. (17) tends to overlap as much as possible its broadcast sessions. 
Clearly pipe stages are represented here as full processors, and the concept of 
machine cycle, or pipe stage time of pipelined processor, simply collapses to 
the concept of time step as introduced in Def. 3. 

A number of considerations like those above brought us to the name we use 
for this special case of our algorithm, as algorithm of “pipelined gossiping.” 
We will remark them in the following using the italics typeface. 

Clearly using this permutation leads to better performance. In particular, after 
a start-up phase {after filling the pipeline), sustained performance is close to 
the maximum—a number of unused slots {pipeline bubbles) still exist, even in 
the sustained region, but here p, reaches value iV -|- 1 half of the times (if N is 
odd). In the region of decay, starting from time step 19, every new time step 
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id 

1 
0 

1 

2 

3 

4 

5 

6 

7 

8 


Step 


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 

52 53 54 55 56 57 58 - R 1 R 2 R 3 R 4 R 5 R 6 R 7 R 8 - 

Ro 52 53 54 55 56 57 58 5o - - 

- RoRl S 3 54 55 56 57 58 5o 5i - 7?3^4- 

- -RoRiR 2 r^ 54 55 56 Sj Ss So 5i 52 - R^R^RoRrRs - 

-^ ^5 56 57 58 5o 5i 52 53 - R^RgRjRs - 

- RoRiR2R3Ri So Sj S^ Sq Si S 2 S^ S^ - R 0 R 7 R 8 - - 

- R0R1R2R3R4R5 Sj 58 5 o 5 i 52 53 54 55 - RjRg - 

- RoR^R 2R3R4R5R6 rx Ss So 5i 52 53 54 55 56 - i?8 

- — — — — — — R 0 R 1 R 2 R 3 R 4 R 5 R 6 R 7 ^ So Si S 2 S 3 S 4 S 3 So S^ 


224466888888888888664422 


Table 5 

Run-table of a run for = 8 using the permutation of Eq. (17). /x is equal to 6 
slots out of 9, with an efficiency e = 66.67% and a length A = 24. Note how 17 is a 
palindrome string. 



Fig. 9. Comparison between run lengths resulting from the identity permutation 
(dotted parabola) and those from permutation (17). The former are shown for 
1 < A7 < 160, the latter for 1 < iV < 500. 

a processor fully completes its task. Similar remarks apply to Table 5; this is 
the typical shape of a run-table for N even. This time the state within the 
sustained region is more steady, though the maximum number of used slots 
never reaches the number of slots in the system. 

It is possible to show that the distributed algorithm described in Fig. 1, with 
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Fig. 10. Comparison between values of ^ derived from the identity permutation 
(dotted parabola) and those from permutation (17) for 1 < iV < 10. 



N 


Fig. 11. Comparison of efficiencies when V is the identity permutation and in the 
case of permutation (17), for 1 < < 160. 

V as in Eq. (17), can be computed in linear time: 


Proposition 18 


Atv — 3iV. 


PROOF. 

Let us consider run-table iV-|- 1. Let us strip off its last row; then remove each 
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occurrence of Sn+i, shifting each row leftwards of one position. Remove also 
each occurrence of Rn+i- Finally, remove the last column, now empty because 
of the previous rules. 

Our hrst goal is showing that what remains is run-table N. To this end, let 
us remind the reader that each occurrence of an S'at+i action only affects row 
TV -|- 1, which has been cut out. Furthermore, each occurrence of Rn+i comes 
from an S action in row iV-|-l. Finally, due to the structure of the permutation, 
the last action in row TV -|- 1 has to be an Sn —as a consequence, row N shall 
contain an Rn+i, and remnant rows shall contain action Removing the 
Rn +1 allows to remove the last column as well, with no coherency violation 
and no redundant steps. This proves our hrst claim. 

With a reasoning similar to the one followed for Prop. 14 we have that 

a{k + 1) = (a{k) + 3{k + 1))^^. (18) 


that is, by Dehnition 11, 


A(A; + 1) = \{k) + 3. 


(19) 


Recursive relation (19) represents the hrst (or forward) diherence of X{k) (see 
e.g., [16]). The solution of the above is \{k) = 3k. □ 


The efficiency of the algorithm of pipelined gossiping does not depend on N: 

Proposition 19 V/c > 0 : = 2/3. 


PROOF. Again, let U{k) be the number of used slots in a run of k processors. 
From Prop. 18 we know that run-table k dihers from run-table k + 1 only for 
k + 1 “S'” actions, k + 1 “R” actions, and the last row consisting of another 
k + 1 pairs of useful actions plus some non-useful actions. We conclude that 

U{k + l) = U{k)+4:{k + l). (20) 


Via e.g., the method of trial solutions for constant coefficient diherence equa¬ 
tions introduced in [16, p. 16], we get to U{k) = 2k{k -|- 1) which obviously 
satishes recursive relation (20) being 2k{k + 1) + 4(A; + 1) = 2{k + l){k + 2). 

So 

U{k) 2k{k + l) 2 

^ a{k) \k{k + l) 3 

□ 
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10 11 12 13 14 15 16 17 18 19 . . . 


Si S2 S3 S4 - ^ ^1 S2 S3 S4 - RiR 2 R 3 R 4 --- Si S2 S3 54 - R1R2R3R4 - 

i?0 ^ 52 S3 S4 So - R2R3R4R0 S2 S3 S4S0-R2R3... R4R0 S2 S3 S4 So - R2R3R4 - - 

- RoRir^ S3 S4 So Si - ^3 ^4 5 o 5 i - ... i?3-R4l?o^i ^ S3 ^4 5 o 5 i - R3R4 - 

- R0R1R2 S4 So Si S2 - i?4i?ol?i-R2 54 5o 5i . .. 52 - l?4-Ro-Ril?2 ^ 54 5o 5i 52 - R4 

- R0R1R2R3 So Si 52 S3 - R0R1R2R3 r^So... S1S2S3 - ^ 5o Si S2 S3 


2244444444444444444...4444444444422 


Table 6 

The algorithm is modified so that multiple gossiping sessions take place. The central, 
best performing area is consequently prolonged. Therein e is equal to N/{N + 1). 
Note how within that area there are consecutive “zones” of ten columns each, within 
whom five gossiping sessions reach their conclusion. For instance, such a zone is the 
region between columns 7 and 16: therein, at entries (4, 7), (0,9), (1,10), (2,11), and 
(3,12), a processor gets the last value of a broadcast and can perform some work 
on a full set of values. This brings to a throughput of t/2, where t is the duration 
of a slot. 

Proposition 20 V/c > 0 : /Xfc = |(A; + 1). 


PROOF. The proof follows immediately from 

Hk = U {k)/\k = 2k{k + 1)/ {3k). 

□ 


Table 6 shows how a run-table looks like when multiple gossiping sessions 
take place one after the other. As a result, the central area corresponding to 
the best observable performance is prolonged. In such an area, e has been 
experimentally found to be equal to N/{N + 1) and the throughput, or the 
number of fully completed gossipings per time step, has been found to be equal 
to t/2, t being the duration of a time step. In other words, within that area a 
gossiping is fully completed every two time steps in the average. A number of 
algorithms which are based on multiple gossipings may greatly beneht from 
this approach, e.g., those implemented in the distributed voting algorithm 
described in [3]. 

Obviously our model reaches such a performance only if the system calls for 
exactly one time step to communicate between any two processors, like e.g. 
in a crossbar system. This is similar to the constraint of hardware pipelined 
processors which call for a number of memory ports equal to n, n being the 
number of pipeline stages supported by that machine—in this way the system 
is able to overlap any two of its stages. This of course turns into requiring to 
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Fig. 12. A graphical representation for run-table 30 when V is permutation (17). 

have a memory system capable of delivering n times the original bandwidth 
of a corresponding, unpipelined machine [17]. 

Note how the specularity of graphs like the one of Fig. 12 translates into a 
palindrome z7 string. 


6 Further Optimizations 


Evidently the execution of ” and “rv” actions—the “bubbles”—is an im¬ 
pairment towards the optimum. As a consequence, a general strategy to in¬ 
crease the performance of our algorithms could be the following one: 

(1) execute a base rule (corresponding to the adoption of any permutation 
P, like e.g., those presented in §3, §4, or §5), and 

(2) as soon as there is a wait-in-sending, choose a different destination be¬ 
tween those that would execute a wait-in-receiving. 

In other words, the processor who gains the right to broadcast does follow the 
order given by V unless it knows (by calculating its own run-table) that doing 
that it would trigger a wait-to-send action. In such latter case, it looks for 
another candidate among those following the current one in permutation V. 
If there exists at least one such processor that would otherwise be wasting its 
slot in a wait-in-receiving, then the message is sent to it. In some sense, this 
allows each broadcasting processor to rearrange its leading V into a “better” 
V', driven by the possibility to (locally) increase the number of used slots. 
Of course this local gain might turn into a loss later on. In this Section we 
describe how the above procedure perturbs the values of A, /x, and e for the 
so far discussed three cases. 

This strategy depicts similarities with the optimization method known as 
Pipeline Scheduling and described in [17]: for any program p whose run cor¬ 
responds to the execution of IC(p) instructions (IC(p) = instruction count of 



program p), namely 


(-^fc)l<fc<IC(p); 


( 22 ) 


the detection of an obstacle to the optimum (a stall) triggers an attempt to 
go round it by trying to rearrange (22) as 


(-fj')lSJSIO(p). 


(23) 


where (23) is substantially a (semantically equivalent) permutation of (22) 
such that the obstacle is removed. 

Of course in our case semantical equivalence is guaranteed by the fact that 
each processor just modihes its permutation on-the-fly. This does not affect 
the output of the algorithm, so our task is much easier than if we had to 
implement actual pipeline scheduling. 

The following algorithm can be used to simulate a run and compute the entire 
broadcasting sequence i.e., the second row of Fig. 1: 

Algorithm 2 : Gossiping with permutation scheduling 

Input: N, V, p (processor id) 

Input: run (running run-table), t (current time step) 

Input: m (message to he broadcast) 

Output: run, t 

1 begin 

2 f:= TRUE { set each entry of f to TRUE } 

3 i:=l 

4 w := 0 

5 while i < N do { for each symbol ofV} 

6 j-.= i-\-t-\-w — 1 

{ if Vi has never been used and processor i is available } 

7 if /i = TRUE A run(Pj,j) = FREE then 

8 fi := FALSE { mark Vi as used } 

9 Send m to processor Vi 

10 run(p, j) := pS^Vi 

11 rnnlVi, j) := Vi W p 

12 i := i -\-1 { go to next item ofV} 

{ if Vi has been already used or processor i is not available } 

13 else { i.e., when fi = FALSE V rnn{Vi,j) ^ FREE } 

{ orderly search for a possible substitute } 

14 stop := FALSE 

15 / := 1 

16 while I < N A stop = FALSE do 

17 if fi = TRUE A Tnn{Vi,j) = FREE then 
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18 stop ;= TRUE 

19 else 

20 /:=/ + ! 

21 endif 

22 enddo 

{ if a candidate has been found at entry I, use Vi instead of Vi } 

23 if stop = TRUE then 

24 fi := FALSE 

25 Send m to processor Vi 

26 run(p, j) := pS^ Vi 

27 Tuia.lVi,j):=ViR^p 

28 i := i + 1 

29 else { if no such an I exists.. . } 

30 run(p,j) := r\ { store a wait-for-sending } 

{ deal again with the current value ofi, but on next column } 

31 w := w -\-l 

32 endif 

33 endif 

34 enddo 

35 t := t + N -\- w 

36 end. 


6.1 Applying Algorithm 2 to the Case of the Identity Permutation 


Figures 13, 14, and 15 describe the improvement we observed by applying 
optimizing Algorithm 2 to the case of the identity permutation. 

We experimentally found that the strategy does greatly improve the values of 
/i. A, and e. We also observed that a particularly good case occurs when the 
number of processors employed is a power of two. Table 7 for instance shows 
run-table 7, that has an efficiency of 73.68%. This efficiency though tends 
to decrease. In particular we found that, after the case of A^ = — 1, the 

efficiency becomes lower than 2/3 i.e., the one of the algorithm of pipelined 
gossiping (see Table 8.) 


6.2 Applying Algorithm 2 to the Case of the Pseudo-Random Permutation 


Also when applied to the case of the pseudo-random permutation. Algorithm 2 
improves performance—this is shown for 1 < N < 160 in the Fig. 16, Fig. 17, 
and Fig. 18. This time the improvement is not as high as in §6.1. 
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Fig. 13. This picture portrays and compares run lengths for 1 < < 160 when V 

is the identity permutation (dotted parabola), when Algorithm 2 is applied to the 
case of the identity permutation (piecewise line), and in the case of the pipelined 
gossiping. Note how Algorithm 2 always improves its base method, and in a small 
number of cases {N = 2* — l,i < 10) it reaches a better performance than that of 
pipelined gossiping (see Table 8). This is also shown in Fig. 15. 



Fig. 14. Values of ^ for the three cases of Fig. 13. 
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1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 


Step 


0 

1 

2 

3 

4 

5 

6 
7 


Si S 2 S 3 54 55 56 57 Ri R 2 R3RiR6R5R7 - 

Ro S 3 S 2 S 3 54 57 56 5o R 3 R 2 R 5 R 4 R 6 -Ry - 

— RoRi S 3 Sq 54 55 S 7 So Si R 3 R 7 R^R^ — — — Ro — 

— Ri R 0 R 2 Sj S 3 S 4 So Si So S 2 R 5 R 7 R 4 R 6 — — — — 

- 55 56 57 5o 5i 52 53 R 3 R 6 R 7 - - 

- R 1 R 0 R 3 R 2 R 4 S 7 So Si S 3 So S 2 S 4 R 7 R 0 - - 

- R 2 R 11 R 1 R 3 R 4 R 0 S 7 So Si rx S 3 54 55 52 R 7 

- R3RiRoR2R7>R4R% 52 S 3 So Si S 3 54 56 

2446888888888664422 


Table 7 

Run-table 7 for V equal to the identity permutation, modified by Algorithm 2 
in = 5.89, e = 73.68%). 



Fig. 15. Values of e for the three cases of Fig. 13. Peek values are in Table 8. 


D 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

£ 

100 

85.71 

73.68 

71.43 

69.66 

68.11 

67.55 

67.11 

66.88 

66.75 

65.34 


Table 8 

£ values for different values of V = 2* — 1. 


26 












Fig. 18. Comparison of the values of e in the two cases of Fig. 16. 


I"? 

123456789 10 11 12 

0 

515253F4R2R1 -R 3 A 4 - 

1 

RoS3S2r^S4SoR2R4R3 - 

2 

- R 0 R 1 S 3 S 0 S 4 S 1 - -R 3 R 4 - 

3 

- R 1 R 0 R 2 r\r^S4So Si S 2 - R 4 

4 

- RoRiR2R3SiSon.S2S3 

V 

244444444222 


Table 9 

Run-table 4 in pipelined gossiping mode and applying Algorithm 2. n = 3.33 slots 
out of 5, or an efficiency of 66.67%. In other words, Algorithm 2 affected the run- 
table without developing any improvement—in particular, the ending order has 
changed. 


6.3 Applying Algorithm 2 in the Pipelined Broadcast Mode. 


When coupling Algorithm 2 to the algorithm of pipelined gossiping, the local 
optimizations gave unstable, and in some cases even negative returns (see 
Fig. 19, 20, and 21). For instance. Table 9 is run-table 4, which shows the 
same values of p and e as if we had performed no optimization at all. N = IS 
is an example of negative return—in this case e.g., e falls to 60%. 









Fig. 21. Values of e in the two cases of Fig. 19. 

7 Applicative Examples 


In this section we provide two example applications for the algorithms de¬ 
scribed in this paper: a restoring organ (Sect. 7.1) and a proposal for a Hopheld 
neural network architecture (Sect. 7.2). 


7.1 The EFTOS Voting Farm 


The EFTOS Voting Farm (VF) is a software component that can be used 
to implement restoring organs i.e., iV-modular redundancy systems (AMR) 
with A-replicated voters [12] (see Fig. 22). Basic design goals of such tools 
include fault transparency but also replication transparency, a high degree of 
flexibility and ease-of-use, and good performance. Restoring organs allow to 
overcome the shortcoming of having one voter, the failure of which leads to 
the failure of the whole system even when each and every other module is 
still running correctly. From the point of view of software engineering, such 
systems though are characterised by two major drawbacks: 

• Each module in the AMR must be aware of and responsible for interacting 
with the whole set of voters; 

• The complexity of these interactions, which is a function that increases 
quadratically with A (the cardinality of the set of voters), burdens each 
module in the AMR. 
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Fig. 22. A restoring organ [12], i.e., a A-modular redundant system with N voters, 
when N = 3. Note that a de-multiplexer is required to produce the single final 
output. 



Fig. 23. Structure of the EFTOS VF for N = 3. 


To overcome these drawbacks, VF adopts a different procedure, as described 
in Fig. 23: in this new procedure, each module only has to interact with, and 
be aware of one voter, regardless of the value of N. 

The VF is an example of an application taking advantage of the algorithms 
described in this paper: indeed its voters play the role of the processors of 
Sect. 2. In a fully connected and synchronous system then steady state per¬ 
formance of the VF follows the ones shown in this paper. In particular this 
leads to high scalability and performance. 

A thorough description of the VF can be found in [3]. 
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1.2 Applications to Hop field Neural Networks 


A well known paradigm of Neural Computing is minimisation [21], Such cog¬ 
nitive technique has been proved to be able to provide satisfactory solutions 
to two classes of problems: 


(1) Recognition, where a partial or corrupted pattern is given as input and 
the action of the system network is to recognise it as one of its stored 
patterns. 

(2) Discovery of local minima—a typical example being the travelling sales¬ 
man problem [10]. 


Hopfield networks [9,19] have been found to be particularly useful in solving 
the above two classes of problems. A Hopheld network substantially is a net of 
binary threshold logic units, connected in an all-to-all pattern, with weighted 
connections between units. Weights are changed according to the so-called 
Hebb rule—that takes over the role of the training step of, e.g., the multi-layer 
perceptron [20]. Given a partial or corrupted input pattern, a Hopheld network 
allows to determine which of the data stored in the network resemble the most 
the input pattern. This is achieved by means of an iterative procedure, the 
starting point of which is the input pattern, which consists of serial, element 
by element updating. This procedure is indeed a gossiping algorithm. When 
the number of neurons is large the adoption of a scalable procedure like the 
algorithm of pipelined gossiping could provide a satisfactory solution. 
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8 Conclusions and Future Work 


As e.g., in [2], a formal model for a family of algorithms depending on a combi¬ 
natorial parameter, P, has been introduced and discussed. Several case studies 
have been designed, simulated, and analyzed, also categorizing in some cases 
their asymptotic bahaviour. In one of these cases—the algorithm of pipelined 
gossiping—it has been proved that the efficiency of the algorithm does not 
depend on N —a result that overcomes those of all the known gossiping algo¬ 
rithms [14,11], An optimizing algorithm has been presented and discussed as 
well. 

We experimentally found that the efficiencies of the base cases, improved via 
the optimization algorithm, lay in general quite “close” to the efficiency of the 
algorithm of pipelined gossiping. In particular we found that: 

• The simplest base case, leading to the worst observed performance, is the 
case which best matches the optimizing algorithm 2. Combining this worst 
case with the optimization actually leads to a great improvement which, for 
some values of N, raises performance even above the values of the “best” 
base case. 

• Nearly no improvement comes when trying to optimize the “best” case. 

The two above observations seem to suggest that, from a certain value of N 
onward, the algorithm of pipelined gossiping actually is the “best” member 
of the family exposed herein. This is suggested for instance from Fig. 15 and 
Table 8 which show how even in the best cases, corresponding to a number of 
processors equals to a power of 2, there is experimental evidence that, sooner 
or later, the complexity of the problem brings efficiency below 2/3, the one of 
pipelined gossiping. Fig. 15 shows also that the optimization of the best case 
in general does not improve the best case without optimization. This brought 
us to the following Conjecture: 

Sooner or later, efficiency reaches a value less than or equal to the one of 
pipelined gossiping: 

Conjecture 21 For any f (function transforming parameters like V), let us 
call Sf^k the efficiency of f in a run with N = k and Spb = 2/3 the efficiency of 
the algorithm of pipelined gossiping; then there exists an integer m such that 

'in > m ■. Sf^n ^ £pB- 

Investigating the above Conjecture will be part of future works. 

Of course the optimizing Algorithm 2 is not the only one nor the best possible 
one. On the contrary, it is characterized by an optimization policy which only 
takes into account the local gain of the current processor, without any reference 
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to possible global optimization strategies (e.g., considering also the scenarios 
that the rest of the processors are going to face because of the current local 
choice). Techniques based on trying alternative solutions and choosing the 
best one, possibly considering future consequences of current, local decisions, 
may reveal themselves as more appropriate and performant and may be used 
to validate the considerations that brought us to Conjecture 21. 

This paper introduced a family of algorithms depending on a combinatorial 
parameter and showed that an optimum exists for its performance is a special 
case—fully connected and synchronous systems. Note how such an optimum 
may exist also in other cases—an open question that may be dealt with in 
future work. Should such optima exist, then any tool using our algorithms 
could adapt to a change in the communication infrastructure by simply “load¬ 
ing” the new optimum. This may have positive relapses on optimal porting of 
gossiping services or in mobile systems using gossiping. 
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